Data Science & Sports Betting

THIS PROJECT WAS NOT MADE WITH THE INTENTION TO PROMOTE OR SUPPORT GAMBLING/SPORTS BETTING

If you have an addiction, seek help.

Gambling Addiction Hotline: 1-800-GAMBLER

Introduction:

Before we dive in, let’s clear something up right away. I do not condone supporting, cheering for, or celebrating the Dallas Cowboys in any way, shape, or form.

With that out of the way, Dak Prescott happens to be the longest-tenured active quarterback for a single team in the NFL (thanks, Google). For our purposes, that means more data—and more data means more accurate predictions.

So, what’s this project all about? It’s about using machine learning to predict a player's stat outcome for a game. In essence, sports betting is just data science in disguise. Our goal here is to see if we can use the same tools—data science and machine learning—to challenge the odds set by Vegas.

How are we going to do this? By building a simple, supervised linear regression model to predict Dak Prescott’s passing yards in an upcoming game against the San Francisco 49ers (Week 8, 2024 || 27 Oct 2024).

Link to the project's GitHub Repository.

Link to download the PDF of my notebook.

The Data:

I was able to get all the data I needed from pro-football-reference.com.

Dak's Regular Season Career Stats: Link || Download (CSV)
Dak's Stats VS 49ers: Link
SF's Defensive Stats 2024: Link

Big thanks to them!

Since our player in this project is a quarterback, I think it's fitting to try to predict Dak's passing yards for this upcoming game.

If you open the CSV for Dak's regular season stats, you'll notice I removed the rushing and receiving stats. Why? 1. They're irrelevant for this project, and 2. There were some junk columns that were causing issues with Pandas.


import pandas as pd
    
df = pd.read_csv('DakPrescott.csv')  # Loading our data into pandas
df = df.dropna()  # Dropping samples from our data that are blank
    
print(df.head())  # Displaying our clean data


    Output:
    
    Rk     Year       Date   G#  Week   Age Team  Opp   Result   Cmp   Att  \
    0 1.00 2,016.00  9/11/2016 1.00  1.00 23.04  DAL  NYG  L 19-20 25.00 45.00   
    1 2.00 2,016.00  9/18/2016 2.00  2.00 23.05  DAL  WAS  W 27-23 22.00 30.00   
    2 3.00 2,016.00  9/25/2016 3.00  3.00 23.06  DAL  CHI  W 31-17 19.00 24.00   
    3 4.00 2,016.00  10/2/2016 4.00  4.00 23.07  DAL  SFO  W 24-17 23.00 32.00   
    4 5.00 2,016.00  10/9/2016 5.00  5.00 23.07  DAL  CIN  W 28-14 18.00 24.00   
    
    Cmp%  Pass_Yds  Pass_TD  Int  Passer_rating   Sk   Yds   Y/A  AY/A  
    0 55.56    227.00     0.00 0.00          69.40 0.00  0.00  5.04  5.04  
    1 73.33    292.00     0.00 0.00         103.70 4.00 14.00  9.73  9.73  
    2 79.17    248.00     1.00 0.00         123.60 0.00  0.00 10.33 11.17  
    3 71.88    245.00     2.00 0.00         114.70 2.00 11.00  7.66  8.91  
    4 75.00    227.00     1.00 0.00         117.90 1.00  5.00  9.46 10.29

In the output, we see the first five rows of our data, which include labels for our columns. However, columns like Date, Result, and Week are examples of categorical data. For this project, we’re focusing on the numerical data.


nummerical_cols = ['Age', 'Cmp', 'Att', 'Cmp%', 'Pass_Yds', 'Pass_TD', 'Int', 'Passer_rating', 'Sk', 'Y/A']
features = df[nummerical_cols]

We’ve now isolated the numerical columns in the dataset, assigning them to the variable features. These will be the features we use for our model. We can now check out some basic statistics from this data.


pd.set_option('display.float_format', '{:,.2f}'.format) # Display numbers with up to two decimal places
print(features.describe(percentiles=[0.25, 0.5, 0.75])) # Display stats based on our numerical data


    Output:
    
            Age    Cmp    Att   Cmp%  Pass_Yds  Pass_TD    Int  Passer_rating  \
    count 120.00 120.00 120.00 120.00    120.00   120.00 120.00         120.00   
    mean   26.72  22.81  34.14  67.15    258.84     1.75   0.67          98.55   
    std     2.55   6.24   8.85   9.53     84.82     1.27   0.83          26.72   
    min    23.04   4.00   8.00  37.84     37.00     0.00   0.00          30.40   
    25%    24.15  19.00  27.00  61.54    207.75     1.00   0.00          80.77   
    50%    26.13  22.00  33.00  66.67    249.50     2.00   0.00         101.05   
    75%    29.12  27.00  39.00  72.64    295.25     3.00   1.00         116.67   
    max    31.08  42.00  58.00  88.89    502.00     5.00   3.00         158.30   
                
            Sk    Y/A  
    count 120.00 120.00  
    mean    2.09   7.64  
    std     1.66   1.81  
    min     0.00   3.46  
    25%     1.00   6.40  
    50%     2.00   7.69  
    75%     3.00   9.00  
    max     8.00  12.66

In the notebook, I visualize this data using histograms and scatter plots. To reduce clutter, I won’t include them here, but I encourage you to download the PDF of my notebook if you're interested in seeing them.

However, I will show the correlation matrix here as it is directly relevant to what we're doing in this project:

Correlation Matrix showing relationships between completions, attempts, passing yards, passer rating, and more.

Notice how passer rating has a strong correlation with Y/A (Yards per Attempt). This makes perfect sense if you know the Passer Rating formula:

Pretty neat, right?

We should also look at the correlation between passing yards and variables like completions (Cmp), attempts (Att), touchdowns (Pass_TD), passer rating, and yards per attempt (Y/A). These features will heavily influence the predictions of the model. While multicollinearity could be an issue, we are trying to keep this project as simple as possible.

The Model:

Now, we need to separate our target variable (Passing Yards) from the rest of the features.


features = features.drop(columns='Pass_Yds') # Removing Pass_Yds column
target = df['Pass_Yds'] # Assigning target to the column of Pass_Yds

Now, we have our target and features. Let’s split the data into training and testing sets.


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2) # Split the data 80/20 into training and testing sets

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape) # Display the shape of our sets.


    Output:

    (96, 9) (96,) (24, 9) (24,)

Now we can make our model, fit the data, and train the model. Then, we'll back-test it.


from sklearn.linear_model import LinearRegression

linear_reg = LinearRegression() # Create the linear regression model
linear_reg.fit(X_train, y_train) # Fit the model with our training data
linear_pred = linear_reg.predict(X_test) # Generate predictions

It's that easy. Now, let’s see how our model performed.


dif = [ ]
table_data = [ ]

for i in range(len(linear_pred)):
    predicted = linear_pred[i]
    actual = y_test.values[i]
    difference = actual - predicted
    dif.append(difference)

    # Append each row as a dictionary
    table_data.append({"Predicted": predicted, "Actual": actual, "Difference": difference})

# Convert the list of dictionaries into a pandas DataFrame
test_data = pd.DataFrame(table_data)
test_stats = test_data.describe(percentiles=[0.25, .5, .75]) # Calculate stats for the test

# Display the data
print(test_data)
print(test_stats)


    Output:

    Predicted  Actual  Difference
    0      169.04  168.00       -1.04
    1      494.55  502.00        7.45
    2      276.83  273.00       -3.83
    3      222.53  223.00        0.47
    4      280.76  284.00        3.24
    5      247.56  249.00        1.44
    6      330.06  334.00        3.94
    7      232.07  232.00       -0.07
    8      280.93  272.00       -8.93
    9      305.27  299.00       -6.27
    10     171.60  179.00        7.40
    11     260.90  255.00       -5.90
    12     223.62  212.00      -11.62
    13     345.42  352.00        6.58
    14      76.77  102.00       25.23
    15     258.54  248.00      -10.54
    16     372.32  374.00        1.68
    17     239.04  245.00        5.96
    18     249.23  249.00       -0.23
    19     232.76  227.00       -5.76
    20     151.11  170.00       18.89
    21     325.50  319.00       -6.50
    22     225.55  212.00      -13.55
    23     253.64  247.00       -6.64

           Predicted  Actual  Difference
    count       24.00    24.00       24.00
    mean     259.40  259.46         0.06
    std           82.10    80.25         9.17
    min          76.77  102.00      -13.55
    25%       225.06  220.25        -6.33
    50%       251.43  248.50        -0.15
    75%       287.01  287.75         4.45
    max       494.55  502.00       25.23

We can see the average difference between the predicted and actual values is only 0.06 yards—not too shabby.

Results:

Now, we just need to feed the model values for the features. But, where are we going to get those values? We’ll make them up—kind of. Since we trained the model on historical data, we can generate values based on historical data.

I will be using stats from Dak’s past games vs San Francisco and normalizing them against the performance of the 49ers' defense so far this year. We are currently going into week 7 of the NFL, so we will be using data up to week 6 as they haven't played their week 7 game yet.

Once again, data is coming from pro-football-reference.com:

Dak's Stats VS SF: Link
49ers' Defensive Stats 2024: Link


# Dak's Stats vs SF
num_games = 3

dak_cmp_SF = 53/num_games
dak_att_SF = 81/num_games
dak_cmpPer_SF = 65.43
dak_td_SF = 6/num_games
dak_int_SF = 3/num_games
dak_pRate_SF = 98.4
dak_sk_SF = 5/num_games
dak_ydsAtt_SF = 7.8


# 49ers' Defensive Stats 2024 (So far)
games = 6

SF_passAlw = (198+257+198+143+189+309)/games
SF_int = 6/games
SF_sks = 16/games
SF_pasTDAlw = 8/games
SF_passCmp = 120/games
SF_passAtt = 195/games
SF_cmpPer = (SF_passCmp/SF_passAtt)*100
SF_yardsAtt = (SF_passAlw/SF_passAtt)


# Normalizing stats
norm_cmp = (dak_cmp_SF/SF_passCmp) * dak_cmp_SF
norm_att = (dak_att_SF/SF_passAtt) * dak_att_SF
norm_cmpPer = (dak_cmpPer_SF/SF_cmpPer) * dak_cmpPer_SF
norm_TD = (dak_td_SF/SF_pasTDAlw) * dak_td_SF
norm_int = (dak_int_SF/SF_int) * dak_int_SF
norm_sacks = (dak_sk_SF/SF_sks) * dak_sk_SF
norm_yardsAtt = (dak_ydsAtt_SF/SF_yardsAtt) * dak_ydsAtt_SF

Now that’s out of the way, we can see what values will be going into the model to predict Dak’s upcoming passing yards.


# On October 27, 2024, Dak will be 31.25 years old
stats_vs_SF = [31.25, norm_cmp, norm_att, norm_cmpPer, norm_TD, norm_int, dak_pRate_SF, norm_sacks, norm_yardsAtt]

nummerical_cols.remove('Pass_Yds')
test = pd.DataFrame([stats_vs_SF], columns=nummerical_cols)

print(test)


    Output:

        Age     Cmp     Att    Cmp%     TD    Int  Rating     Sk    Y/A
    0 31.25   15.61   22.43   69.57   3.00   1.00   98.40   1.04   9.17

We are finally ready to get our prediction.


predict = linear_reg.predict(test) # Plug our normalized values in to get a prediction
pass_yards = predict[0] # Extract the passing yards from the prediction array
print(f'This Model predicts Dak Prescott will pass for {pass_yards:.2f} yards vs SF on 10/27/2024') # Results


    Output:

    This Model predicts Dak Prescott will pass for 240.48 yards vs SF on 10/27/2024

There it is: our prediction using a machine learning linear regression model—240 passing yards. There’s one more thing we can do: normalize this value against the 49ers' defense's passing yards allowed, and then we can compare.


dak_yds_SF = 632/num_games
norm_passYards = (dak_yds_SF/SF_passAlw) * pass_yards

print(f'Dak AVG Passing Yards vs SF: {dak_yds_SF:.2f}\nSF AVG Passing Yards Allowed in 2024: {SF_passAlw:.2f}\nPredicted Passing Yards by Dak: {pass_yards:.2f}\nNormalized Predicted Pass Yards: {norm_passYards:.2f}')


    Output:

    Dak AVG Passing Yards vs SF: 210.67
    SF AVG Passing Yards Allowed in 2024: 215.67
    Predicted Passing Yards by Dak: 240.48
    Normalized Predicted Pass Yards: 234.91

Outcome:

At the time of writing this (16 OCT 2024), player prop bets are not posted yet for week 8 games. So, you and I will have to come back and see what Vegas thinks about how Dak will perform.

Conclusion:

This was just a quick project I threw together based on things I've learned from reading and my Data Science and Analytics class. There are plenty of things to expand upon and build, including different types of models to run. Maybe there will be a part 2!